We all want to write good code

But good code doesn’t just magically happen


Despite many researchers using code, very few are trained in programming.
This increases the chance for messy or broken code in science.

Supporting good coding practices at the ALA

How to write good code
for science



Example code

# bad example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email="dax.kellie@csiro.au")
alaData<-galah_call()|>identify("perameles")|>filter(year==2003) |>
  select(group="basic",cl22)|>
  atlas_occurrences()|>
  select(recordID,scientificName,decimalLongitude,decimalLatitude,eventDate,cl22)|>
  janitor::clean_names()|>rename(state=cl22)|>
  mutate(event_date=lubridate::ymd(event_date))|>
  group_by(state)|>count()|>drop_na()
aus<-ozmaps::ozmap_states|>
  sf::st_transform(crs=4326)|>
  left_join(alaData, join_by(NAME==state))|>
  replace_na(list(n=0))

ggplot()+sf::geom_sf(data=aus,aes(fill=n),colour="grey60")+
  viridis::scale_fill_viridis(option="F",begin=0.2,direction=-1)+theme_void()+theme(legend.position="right")

Code style

  • Write code like you write sentences, add spaces
# better example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")
alaData <- galah_call() |> identify("perameles") |> filter(year == 2003) |>
  select(group = "basic", cl22)|>
  atlas_occurrences() |>
  select(recordID, scientificName, decimalLongitude, decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> rename(state = cl22) |>
  mutate(event_date = lubridate::ymd(event_date)) |>
  group_by(state) |> count() |> drop_na()
aus <- ozmaps::ozmap_states |>
  sf::st_transform(crs = 4326) |>
  left_join(alaData, join_by(NAME == state)) |>
  replace_na(list(n = 0))

ggplot() + sf::geom_sf(data = aus,aes(fill = n),colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", begin = 0.2, direction = -1) + theme_void() + theme(legend.position = "right")

Code style

  • Break tasks into smaller code chunks
  • Line breaks allow code to breathe
# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Naming things

Bad

Objects

DT_3_final # unclear
map1       # vague
mapOfMarineInvertsFilteredByYear # too long


Functions

func1() # unclear
filter() # unclear, already used
remove_outlier_if_out_of_range # too long
Good

Objects

cockatoos_grouped
map_inverts
map_inverts_marine


Functions

get_counts() 
remove_outliers() 
has_outlier()

Naming things

Datset variable names

  • Variable names align with terms in final paper
  • Define variables in a README or in-text section

Example:

site_id = Unique site ID
individual_id = Unique ID of each individual
scientific_name = genus and species name
group = Unique group ID (control, high_condition, low_condition)
leaf_length = Length of leaf (mm)
leaf_width = Width of leaf (mm)
leaf_mass_per_area = Leaf dry mass per area, ratio of leaf size to density (g/m2)
temp_max = Maximum temperature, 2023 (C°)
soil_n = Soil nitrogen (mg/kg)

site_id individual_id scientific_name group leaf_length leaf_width leaf_mass_per_area temp_max soil_n
AAC101 B-31 Acacia acuminata control 70.4 3.3 105.66 42.23 25.67
AAC101 B-32 Acacia acuminata high_condition 84.0 2.4 102.26 42.23 25.76
ABB011 L-12 Acacia bidwillii low_condition 8.2 4.2 83.98 43.04 37.34

Code comments

  • Short but informative
  • Too many long comments is a good indicator that code is unclear (our example from earlier needs minimal notes)
# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Code comments

  • Use long comments to discuss why you did something
    • e.g., Interpret results of a statistical test

Example:

# Variables bio_12 and bio_16 are very highly correlated (94%). 
# To avoid multicollinearity, we’ll include only bio_12 and bio_4

Variables + Functions

  • Sometimes you can’t avoid complex code

What if you are a busy scientist short on time?

  • Making code understandable is more important than making it brief or fancy

Summary

Clean and readable code

  • Write code like you write sentences
    • Add spaces between operators (e.g. +, -, *)
    • Add spaces after commas
  • Break tasks into smaller code chunks
  • Line breaks allow code to breathe

Naming things

  • Simple but understandable names, avoid acronyms
  • Make sure variable names align with terms in final paper
  • Define variables in a README or in-text section

Code comments

Variables + Functions

When you’re busy, prioritise readability!